If we wanted to study climate change, we can find data on the Combined Land-Surface Air and Sea-Surface Water Temperature Anomalies in the Northern Hemisphere at NASA’s Goddard Institute for Space Studies. The tabular data of temperature anomalies can be found here.
To define temperature anomalies we need to have a reference, or base, period which NASA clearly states is the period between 1951-1980.
We run the code below to load the file:
weather <-
read_csv("https://data.giss.nasa.gov/gistemp/tabledata_v3/NH.Ts+dSST.csv",
skip = 1,
na = "***")Notice that, when using this function, we added two options: skip and na. The skip=1 option is there as the real data table only starts in row 2, so we need to skip one row. na = "***" option informs R how missing observations in the spreadsheet are coded. When looking at the spreadsheet, we can see that missing data is coded as "***." It is best to specify this here, as otherwise some of the data is not recognized as numeric data.
For each month and year, the data frame shows the deviation of temperature from the normal (expected). Furthermore, the data frame is in wide format.
We have two objectives in this section:
First is to select the year and the twelve month variables from the weather dataset. We do not need the others (J-D, D-N, DJF, etc.).
Second is to convert the data frame from wide to ‘long’ format.
Let us plot the data using a time-series scatter plot, and add a trend line. To do that, we first need to create a new variable called date in order to ensure that the delta values are plot chronologically.
tidyweather <- tidyweather %>%
mutate(date = ymd(paste(as.character(Year), Month, "1")),
month = month(date, label=TRUE),
year = year(date))
ggplot(tidyweather, aes(x=date, y = delta))+
geom_point()+
geom_smooth(color="red") +
theme_bw() +
labs (subtitle = "Weather Anomalies",
title = "Warmer Times",
caption = "Source: NASA",
x = "Year",
y = "Temperature Deviation") Using
facet_wrap() to produce a separate scatter plot for each month, again with a smoothing line, we can examine if the effect of increasing temperature is more pronounced in some months. Yes, based on the charts above, the effect of increasing temperature is more pronounced in a few months. For example, the effect seems more pronounced in the generally winter months of October through April than it does in the generally summer months of May through September.
It is sometimes useful to group time into different periods to study historical data. For example, we often use decades such as 1970s, 1980s, 1990s, etc. to refer to a period of time. The code below creates a new data frame called comparison that groups data into five time periods: 1881-1920, 1921-1950, 1951-1980, 1981-2010 and 2011-present.
We remove data before 1800 using filter. Then, we use the mutate function to create a new variable interval which contains information on which period each observation belongs to. We can assign the different periods using case_when().
comparison <- tidyweather %>%
filter(Year>= 1881) %>%
mutate(interval = case_when(
Year %in% c(1881:1920) ~ "1881-1920",
Year %in% c(1921:1950) ~ "1921-1950",
Year %in% c(1951:1980) ~ "1951-1980",
Year %in% c(1981:2010) ~ "1981-2010",
TRUE ~ "2011-present"))Now that we have the interval variable, we can create a density plot to study the distribution of monthly deviations (delta), grouped by the different time periods we are interested in.
ggplot(comparison, aes(x=delta, fill=interval))+
geom_density(alpha=0.2) +
theme_bw() +
labs (title = "Density Plot for Monthly Temperature Anomalies",
subtitle = "By Time Interval",
caption = "Source: NASA",
y = "Density",
x = "Delta",
fill = "Interval") So far, we have been working with monthly anomalies. However, we might be interested in average annual anomalies. We can do this by using
group_by() and summarise(), followed by a scatter plot to display the result.
average_annual_anomaly <- tidyweather %>%
group_by(Year) %>%
summarise(annual_average_delta = mean(delta, na.rm=TRUE))
ggplot(average_annual_anomaly, aes(x=Year, y= annual_average_delta))+
geom_point()+
geom_smooth() +
theme_bw() +
labs (subtitle = "Average Yearly Anomaly",
title = "Warmer Times",
caption = "Source: NASA",
y = "Average Annual Delta")deltaNASA points out on their website that > A one-degree global change is significant because it takes a vast amount of heat to warm all the oceans, atmosphere, and land by that much. In the past, a one- to two-degree drop was all it took to plunge the Earth into the Little Ice Age.
Now we will construct a confidence interval for the average annual delta since 2011, both using a formula and using a bootstrap simulation with the infer package. Recall that the dataframe comparison has already grouped temperature anomalies according to time intervals; we are only interested in what is happening between 2011-present.
formula_ci <- comparison %>%
drop_na() %>%
filter(interval == "2011-present") %>%
group_by(Year) %>%
summarise(mean_delta = mean(delta),
SD_delta = sd(delta),
count_delta = n(),
t_critical = qt(0.975, count_delta-1),
SE_delta = sd(delta) / sqrt(n()),
margin_delta = t_critical * SE_delta,
delta_low = mean_delta - margin_delta,
delta_high = mean_delta + margin_delta)
formula_ci## # A tibble: 9 x 9
## Year mean_delta SD_delta count_delta t_critical SE_delta margin_delta
## <dbl> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 2011 0.7 0.103 12 2.20 0.0298 0.0655
## 2 2012 0.765 0.160 12 2.20 0.0462 0.102
## 3 2013 0.753 0.111 12 2.20 0.0321 0.0706
## 4 2014 0.9 0.140 12 2.20 0.0405 0.0891
## 5 2015 1.13 0.163 12 2.20 0.0470 0.103
## 6 2016 1.28 0.326 12 2.20 0.0940 0.207
## 7 2017 1.13 0.213 12 2.20 0.0616 0.136
## 8 2018 0.989 0.158 12 2.20 0.0455 0.100
## 9 2019 1.12 0.163 7 2.45 0.0616 0.151
## # … with 2 more variables: delta_low <dbl>, delta_high <dbl>
boot_delta <- comparison %>%
filter(interval == "2011-present") %>%
specify(response = delta) %>%
generate(reps = 1000, type = "bootstrap") %>%
calculate(stat = "mean")
percentile_ci <- boot_delta %>%
get_confidence_interval(level = 0.95, type = "percentile")
visualize(boot_delta) +
shade_ci(endpoints = percentile_ci,fill = "khaki")+
labs(subtitle = "Bootstrap CI for Weather Delta",
title = "Bootstrapping Delta",
caption = "Source: NASA",
y = "Count",
x = "Average Annual Delta") +
theme_bw() The bootstrap distribution is showing us a simulation of the average annual temperature anomalies for the period 2011 to present. Because there are only eight years and thus eight observations included in that data set, any confidence interval we establish will be very broad. Therefore, bootstrapping allows us to sample with replacement and thus create a larger data set that can establish a narrower confidence interval. The beige area indicated on the sampling distribution above represents the 95% confidence interval based on the bootstrapped sample. The result is that we can say with 95% confidence that the true average annual delta for the period 2011 to present lies between roughly 0.92 and 1.02.
As we saw earlier, fivethirtyeight.com has detailed data on all polls that track the president’s approval
approval_polllist <- read_csv(here::here('data', 'approval_polllist.csv'))
glimpse(approval_polllist)## Rows: 15,619
## Columns: 22
## $ president <chr> "Donald Trump", "Donald Trump", "Donald Trump", "…
## $ subgroup <chr> "All polls", "All polls", "All polls", "All polls…
## $ modeldate <chr> "9/27/2020", "9/27/2020", "9/27/2020", "9/27/2020…
## $ startdate <chr> "1/20/2017", "1/20/2017", "1/20/2017", "1/21/2017…
## $ enddate <chr> "1/22/2017", "1/22/2017", "1/24/2017", "1/23/2017…
## $ pollster <chr> "Gallup", "Morning Consult", "Ipsos", "Gallup", "…
## $ grade <chr> "B", "B/C", "B-", "B", "B-", "C+", "B+", "B", "C+…
## $ samplesize <dbl> 1500, 1992, 1632, 1500, 1651, 1500, 1190, 1500, 1…
## $ population <chr> "a", "rv", "a", "a", "a", "lv", "rv", "a", "lv", …
## $ weight <dbl> 0.262, 0.680, 0.153, 0.243, 0.142, 0.200, 1.514, …
## $ influence <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ approve <dbl> 45.0, 46.0, 42.1, 45.0, 42.3, 57.0, 36.0, 46.0, 5…
## $ disapprove <dbl> 45.0, 37.0, 45.2, 46.0, 45.8, 43.0, 44.0, 45.0, 4…
## $ adjusted_approve <dbl> 45.7, 45.3, 43.2, 45.7, 43.4, 51.5, 37.6, 46.7, 5…
## $ adjusted_disapprove <dbl> 43.6, 38.3, 43.9, 44.6, 44.5, 44.5, 42.8, 43.6, 4…
## $ multiversions <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ tracking <lgl> TRUE, NA, TRUE, TRUE, TRUE, TRUE, NA, TRUE, TRUE,…
## $ url <chr> "http://www.gallup.com/poll/201617/gallup-daily-t…
## $ poll_id <dbl> 49253, 49249, 49426, 49262, 49425, 49266, 49260, …
## $ question_id <dbl> 77265, 77261, 77599, 77274, 77598, 77278, 77272, …
## $ createddate <chr> "1/23/2017", "1/23/2017", "3/1/2017", "1/24/2017"…
## $ timestamp <chr> "00:45:20 27 Sep 2020", "00:45:20 27 Sep 2020", "…
approval_polllist_fix <- approval_polllist %>%
mutate(
modeldate = mdy(modeldate),
startdate = mdy(startdate),
enddate = mdy(enddate),
createddate = mdy(createddate))Using this data we will calculate the average net approval rate (approve - disapprove) for each week since Trump got into office and create a plot that looks like this one.
Next we will compare the confidence intervals for week 15 (6-12 April 2020) and week 34 (17-23 August 2020).
## # A tibble: 2 x 10
## # Groups: year [1]
## year week mean_approval SD_approval count_approval t_critical SE_approval
## <dbl> <dbl> <dbl> <dbl> <int> <dbl> <dbl>
## 1 2020 15 -8.23 4.38 94 1.99 0.452
## 2 2020 34 -12.7 5.31 84 1.99 0.580
## # … with 3 more variables: margin_approval <dbl>, approval_low <dbl>,
## # approval_high <dbl>
We can see that confidence interval for week 15 is slightly narrower than that for week 34 due to a higher standard deviation driving a higher standard error. This may be due to a combination of data that was released in mid-August, including a not-improving COVID-19 situation in the United States coupled with disappointing economic data. The wider confidence interval results in a wider array of values over which we can be 95% confident represent the true share of the population that approves of the President.
Recall the gapminder data frame from the gapminder package. That data frame contains just six columns from the larger data in Gapminder World. Here, we will join a few data frames with more data than the ‘gapminder’ package. Specifically, we will look at data on:
hiv <- read_csv(here::here("data","adults_with_hiv_percent_age_15_49.csv"))
life_expectancy <- read_csv(here::here("data","life_expectancy_years.csv"))
indicators <- c("SP.DYN.TFRT.IN","SE.PRM.NENR", "SH.DYN.MORT", "NY.GDP.PCAP.KD")
library(wbstats)
worldbank_data <- wb_data(country="countries_only",
indicator = indicators,
start_date = 1960,
end_date = 2016)
countries <- wbstats::wb_cachelist$countriesWe will join the 3 data frames (life_expectancy, worldbank_data, and HIV) into one. However, we first have to convert the data to tidy format, starting with the life_expectancy data. Next, we will tidy the HIV data and merge it with the life_expectancy data before finally merging it with the worldbank_data.
The first plot shows the relationship between HIV prevalence in a population and life expectancy, grouped by region of the world.
Mutatation_life_expectancy <- life_expectancy %>%
pivot_longer(cols = 2:300,
names_to = "year",
values_to = "life_expectancy") %>%
select(!2:3)
Mutation_HIV_prevalence <- hiv %>%
pivot_longer(cols = 2:34,
names_to = "year",
values_to ="HIV_prevalence")
Combined_data1 <- left_join(Mutatation_life_expectancy, Mutation_HIV_prevalence, by = c("country","year"))
Combined_data2 <- left_join(Combined_data1, countries, by = "country")
Combined_data3 <- Combined_data2 %>%
select(!5:11) %>%
select(1:5)
Combined_hiv <- Combined_data3 %>%
filter(!HIV_prevalence %in% NA) %>%
filter(!region %in% NA)
ggplot(Combined_hiv, aes(x = HIV_prevalence, y = life_expectancy)) +
geom_point() +
geom_smooth(colour = "red", method = "lm") +
facet_wrap(~region) +
scale_x_log10() +
theme_bw() +
labs(subtitle = "Relationship Between HIV Prevalence and Life Expectancy",
title = "HIV Around the World",
x = "HIV Prevalence",
y = "Life Expectancy") In general, there appears to only be a minor negative relationship between HIV prevalence and life expectancy. Because this data set covers the entire history of HIV, it reflects the impressive strides that humanity has made in improving patient outcomes for those living with the disease, even in the poorest and worst affected regions of the world.
This next plot examines the relationship between fertility rate and GDP per-capita.
Combined_data4 <- left_join(worldbank_data, countries, by = "country")
Combined_gdp <- Combined_data4 %>%
filter(!NY.GDP.PCAP.KD %in% NA,
!SP.DYN.TFRT.IN %in% NA)
ggplot(Combined_gdp, aes(x = SP.DYN.TFRT.IN, y = NY.GDP.PCAP.KD)) +
geom_point() +
geom_smooth(colour="red", method = "lm") +
scale_y_log10(labels = dollar) +
facet_wrap(~region) +
theme_bw() +
labs(subtitle = "Fertility Rate and GDP per Capita",
title = "Who has Babies?",
caption = "Source: World Bank",
x = "Fertility Rate",
y = "GDP per Capita") There is a very consistent negative relationship between fertility rate and GDP per capita. This matches with conventional wisdom and observations around the world in which richer countries have fewer children per woman as women enter the workforce and raising children becomes more expensive.
Returning to HIV data, below is a bar chart of the number of missing observations for HIV data, grouped by region.
Combined_missing <- Combined_data3 %>%
filter(year >= 1979, is.na(HIV_prevalence), !region %in% NA) %>%
group_by(region) %>%
summarise(count = count(country))
ggplot(Combined_missing, aes(x = count,y = reorder(region, count))) +
geom_col() +
theme_bw() +
labs(subtitle = "Missing Data Observations",
title = "Data Deficiency",
x = "Count",
y = "") Below are two sets of charts, one showing the countries that have made the largest improvement in child mortality by region and the other showing those that have made the smallest improvement.
Combined_mortality_top <- Combined_data4 %>%
filter(!SH.DYN.MORT %in% NA) %>%
group_by(country, region) %>%
mutate(improvement = lag(SH.DYN.MORT) - SH.DYN.MORT) %>%
filter(!improvement %in% NA) %>%
summarize(improvement = sum(improvement)) %>%
group_by(region) %>%
slice_max(order_by = improvement, n = 5) %>%
ungroup %>%
mutate(region = as.factor(region),
country = reorder_within(country, improvement, region))
ggplot(Combined_mortality_top, aes(x = country, y = improvement)) +
geom_col() +
facet_wrap(~region, scales = "free") +
coord_flip() +
scale_x_reordered() +
scale_y_continuous() +
theme_bw() +
labs(title = "Save the Children", subtitle = "Largest Child Mortality Rate Improvement", x = "", y = "Rate Improvement", caption = "Source: World Bank")Combined_mortality_bottom <- Combined_data4 %>%
filter(!SH.DYN.MORT %in% NA) %>%
group_by(country, region) %>%
mutate(improvement = lag(SH.DYN.MORT) - SH.DYN.MORT) %>%
filter(!improvement %in% NA) %>%
summarize(improvement = sum(improvement)) %>%
group_by(region) %>%
slice_min(order_by = improvement, n = 5) %>%
ungroup %>%
mutate(region = as.factor(region),
country = reorder_within(country, improvement, region))
ggplot(Combined_mortality_bottom, aes(x = country, y = improvement)) +
geom_col() +
facet_wrap(~region, scales = "free") +
coord_flip() +
scale_x_reordered() +
scale_y_continuous() +
theme_bw() +
labs(title = "Save the Children", subtitle = "Smallest Child Mortality Rate Improvement", x = "", y = "Rate Improvement", caption = "Source: World Bank") Sub-Saharan Africa, the Middle East & North Africa, and South Asia appear to have seen the largest wholesale improvements in child mortality rate while Europe has seen the smallest improvement, likely due to starting at a higher base level.
This final chart will examine the relationship between fertility rate and and primary school enrollment.
Combined_primary <- Combined_data4 %>%
filter(!SE.PRM.NENR %in% NA, !SP.DYN.TFRT.IN %in% NA)
ggplot(Combined_primary, aes(x = SE.PRM.NENR, y = SP.DYN.TFRT.IN)) +
geom_point() +
geom_smooth(colour = "red", method = "lm") +
facet_wrap(~region) +
theme_bw() +
labs(subtitle = "Relationship between Fertility rate and Primary school enrollment",
title = "Educate the Children",
caption = "Source: World Bank",
x = "Primary School Enrollment",
y = "Fertility Rate") Similar to the relationship between fertility rate and GDP per capita, there is a strong negative correlation between primary school enrollment and fertility rate. A possible explanation is that the ability of a country to provide high levels of primary school education is strongly tied to its wealth, thus carrying the tie between GDP per capita and fertility to this relationship as well.
Next we will revisit the CDC Covid-19 Case Surveillance Data.
url <- "https://data.cdc.gov/api/views/vbim-akqf/rows.csv?accessType=DOWNLOAD"
covid_data <- vroom::vroom(url)%>%
clean_names()Given the data we have, we will produce two graphs that show death % rate, first by age group, sex, and whether the patient had co-morbidities or not.
covid_graph <- covid_data %>%
select(medcond_yn, death_yn, sex, age_group) %>%
filter(!medcond_yn %in% c("Missing", "Unknown", "Other", NA))%>%
filter(!sex %in% c("Missing", "Unknown", "Other", NA)) %>%
filter(!age_group %in% c("Missing", "Unknown", "Other", NA)) %>%
filter(!death_yn %in% c("Missing", "Unknown", "Other", NA)) %>%
mutate(death_yn = ifelse(death_yn == "Yes", 1, 0))%>%
mutate(medcond_yn = ifelse(medcond_yn == "Yes", "With comorbidities", "Without comorbidities")) %>%
group_by(age_group, sex, medcond_yn) %>%
summarise(death = prop(death_yn))
ggplot(covid_graph, aes(x = (death * 100), y = age_group)) +
geom_col(fill = "dark blue") +
facet_grid(rows = vars(medcond_yn), cols = vars(sex)) +
geom_text(aes(label = round(death * 100, 1)), position = position_dodge(width = 1), hjust = -0.1) +
labs(title = "Covid death % age group, sex and presence of co-morbidities",
y = "",
x = "Percentage of Deaths in %",
caption = "Source: CDC") +
theme_bw() And second, by age group, sex, and whether the patient was admitted to Intensive Care Unit (ICU) or not.
covid_graph2 <- covid_data %>%
select(icu_yn, death_yn, sex, age_group) %>%
filter(!icu_yn %in% c("Missing", "Unknown", "Other", NA))%>%
filter(!sex %in% c("Missing", "Unknown", "Other", NA)) %>%
filter(!age_group %in% c("Missing", "Unknown", "Other", NA)) %>%
filter(!death_yn %in% c("Missing", "Unknown", "Other", NA)) %>%
mutate(death_yn = ifelse(death_yn == "Yes", 1, 0))%>%
mutate(icu_yn = ifelse(icu_yn == "Yes", "With comorbidities", "Without comorbidities")) %>%
group_by(age_group, sex, icu_yn) %>%
summarise(death = prop(death_yn))
ggplot(covid_graph2, aes(x = (death * 100), y = age_group)) +
geom_col(fill = "dark orange") +
facet_grid(rows= vars(icu_yn), cols = vars(sex)) +
geom_text(aes(label = round(death * 100, 1)), position = position_dodge(width = 1), hjust = -0.1) +
labs(title = "Covid death % age group, sex and presence of co-morbidities",
y = "",
x = "Percentage of Deaths in %",
caption = "Source: CDC") +
theme_bw() Our charts will look similar to these ones.
Recall the TFL data on how many bikes were hired every single day. We can get the latest data by running the following.
url <- "https://data.london.gov.uk/download/number-bicycle-hires/ac29363e-e0cb-47cc-a97a-e216d900a6b0/tfl-daily-cycle-hires.xlsx"
httr::GET(url, write_disk(bike.temp <- tempfile(fileext = ".xlsx")))## Response [https://airdrive-secure.s3-eu-west-1.amazonaws.com/london/dataset/number-bicycle-hires/2020-09-18T09%3A06%3A54/tfl-daily-cycle-hires.xlsx?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAJJDIMAIVZJDICKHA%2F20201004%2Feu-west-1%2Fs3%2Faws4_request&X-Amz-Date=20201004T155121Z&X-Amz-Expires=300&X-Amz-Signature=473bb3ee1d5ea4d11dbfbfea79e594edbbe517035d268131dbc33ba131eb9493&X-Amz-SignedHeaders=host]
## Date: 2020-10-04 15:51
## Status: 200
## Content-Type: application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
## Size: 165 kB
## <ON DISK> /var/folders/5x/kd157hc16h91cxk4h0bt5bpm0000gn/T//RtmpiK9I1R/fileb7146ff6ade9.xlsx
bike0 <- read_excel(bike.temp,
sheet = "Data",
range = cell_cols("A:B"))
bike <- bike0 %>%
clean_names() %>%
rename (bikes_hired = number_of_bicycle_hires) %>%
mutate (year = year(day),
month = lubridate::month(day, label = TRUE),
week = isoweek(day))We can easily create a facet grid that plots bikes hired by month and year. The distribution of bikes hired in May and June 2020 is drastically different from the distributions in previous years. This is clearly the result of the COVID-19 induced lockdowns that were imposed on London during those months.
However, our main challenge is reproducing two charts given below. The first plots the deviation in actual bike rentals from expected bike rentals.
The second looks at percentage changes from the expected level of weekly rentals. The two gray shaded rectangles correspond to the second (weeks 14-26) and fourth (weeks 40-52) quarters.
We used the median to calculate our expected rentals because it minimizes the effect of extreme values, like one in which a tube strike crippled London’s infrastructure for the day. Using the mean would allow this value to drag the expected number of bikes rented upward even though it is clearly an anomaly from the usual trend.
Check minus (1/5): Displays minimal effort. Doesn’t complete all components. Code is poorly written and not documented. Uses the same type of plot for each graph, or doesn’t use plots appropriate for the variables being analyzed.
Check (3/5): Solid effort. Hits all the elements. No clear mistakes. Easy to follow (both the code and the output).
Check plus (5/5): Finished all components of the assignment correctly and addressed both challenges. Code is well-documented (both self-documented and with additional comments as necessary). Used tidyverse, instead of base R. Graphs and tables are properly labelled. Analysis is clear and easy to follow, either because graphs are labeled clearly or you’ve written additional text to describe how you interpret the output.